Efficient Cluster Representation in Similar Document Search
نویسندگان
چکیده
Similar document search is the problem of retrieving documents that resemble a given document. In this paper, we describe a cluster-based retrieval scheme that approximates the classic nearest neighbor search scheme, by identifying the clusters that are closest to the input document and restricting attention to these clusters only. Cluster signatures play an important role in the effectiveness of this approximation, since the inclusion of a cluster in the restricted search depends entirely on whether its signature matches the given document. We study three different representations of cluster signatures and their role in performing a similar document search, while examining only a fraction of the documents from the target corpus.
منابع مشابه
Efficient document retrieval using text clustering
Similar document retrieval is the problem of finding documents that are most similar to a given query document. In this work, we present a retrieval based on clustering of the documents that approximates the nearest neighbor search. It is done by determining the clusters that are most similar to the query document and restricting the search to the documents in these clusters. Cluster representa...
متن کاملDocument Clustering: A Detailed Review
Document clustering is automatic organization of documents into clusters so that documents within a cluster have high similarity in comparison to documents in other clusters. It has been studied intensively because of its wide applicability in various areas such as web mining, search engines, and information retrieval. It is measuring similarity between documents and grouping similar documents ...
متن کاملA word-based soft clustering algorithm for documents
Document clustering is an important tool for applications such as Web search engines. It enables the user to have a good overall view of the information contained in the documents. However, existing algorithms suffer from various aspects; hard clustering algorithms (where each document belongs to exactly one cluster) cannot detect the multiple themes of a document, while soft clustering algorit...
متن کاملPhrase based Clustering Scheme of Suffix Tree Document Clustering Model
Document clustering is one of the difficult and recent research fields in the search engine research. Most of the existing documents clustering techniques use a group of keywords from each document to cluster the documents. Document clustering arises from information retrieval domains, and “It finds grouping for a set of documents belonging to the same cluster are similar and documents belongs ...
متن کاملUsing Web structure and summarisation techniques for Web content mining
The dynamic nature and size of the Internet can result in difficulty finding relevant information. Most users typically express their information need via short queries to search engines and they often have to physically sift through the search results based on relevance ranking set by the search engines, making the process of relevance judgement time-consuming. In this paper, we describe a nov...
متن کامل